What differences can we spot in the words used in Simple English Wikipedia (simple.wikipedia.org) vs. English Wikipedia (en.wikipedia.org) for the same pages within a topic?

There were 264 pages in Simple Wikipedia's mathematics category, and of these, 255 appeared in English Wikipedia. These pages were downloaded using the special exporter: http://simple.wikipedia.org/wiki/Special%3aExport.

After removing category pages, disambiguation pages, and any pages that did not appear in both Wikis after removing the above (see extract_wiki_text.py), we ended up with 229 articles in each.


In [2]:
%matplotlib inline

import nltk
from collections import defaultdict
import pandas as pd

In [3]:
import cPickle as pickle

# Load pickle files generated by extract_wiki_text.py. These files transformed MediaWiki XML to dictionaries mapping article
# titles to article text, cleaning up MediaWiki markup with helper functions from https://github.com/bwbaugh/wikipedia-extractor/.
with open('simple.p') as f:
    sd = pickle.load(f)
    
with open('math.p') as f:
    md = pickle.load(f)

In [4]:
# These are the pages that appeared in both Simple and English Wikipedia.
sd.keys()


Out[4]:
['Heuristic',
 'History of mathematics',
 'Ratio',
 'Piecewise',
 'Number line',
 'Associativity',
 'Module (mathematics)',
 'Function space',
 'Formal language',
 'Theorem',
 'Random variable',
 'Monster group',
 'Axiom',
 'Group theory',
 'Least common multiple',
 'Planck time',
 'Magnitude (mathematics)',
 u'G\xf6del number',
 'Equivalence relation',
 'Pigeonhole principle',
 'American Mathematical Society',
 'Binomial expansion',
 'Gradient',
 'Logarithmic scale',
 'Exponential function',
 'Logarithm',
 'Field (mathematics)',
 'Absolute value',
 'Law of sines',
 'Odds',
 'Distribution (mathematics)',
 '2D',
 'Immersion (mathematics)',
 u'Banach\u2013Tarski paradox',
 'Square root',
 'Manifold',
 'Theory of computation',
 'Estimation',
 'Transitivity (mathematics)',
 'Empty string',
 'Big O notation',
 'Lambda calculus',
 'Surface area',
 'Map coloring',
 'Exponentiation',
 'Coordinate system',
 'Random',
 'Linear equation',
 'Unit circle',
 'Real analysis',
 'Rounding',
 'Pi Day',
 'Currying',
 'Function composition',
 'Geometric topology',
 'Decision theory',
 "Spearman's rank correlation coefficient",
 'Structured program theorem',
 'Limit (mathematics)',
 'Alphabet (computer science)',
 'Counting',
 'Entscheidungsproblem',
 'Algorithmic information theory',
 'Order theory',
 'Dynamical systems theory',
 'Markov chain',
 'Floating point',
 u"G\xf6del's incompleteness theorems",
 'Binary operation',
 'On-Line Encyclopedia of Integer Sequences',
 'Formula',
 'Percentage',
 'Y-intercept',
 'Parameter',
 'International Mathematical Olympiad',
 '3D',
 'Identity (mathematics)',
 'Hyperplane arrangements',
 'Nth root',
 'Random walk',
 'X-intercept',
 'Probability theory',
 'Mathematical model',
 'Strict weak ordering',
 'Proportionality',
 'Base (mathematics)',
 'Average',
 "Hilbert's paradox of the Grand Hotel",
 'Continuous function',
 'Applied mathematics',
 'Wheel theory',
 'Halting problem',
 'Epigraph (mathematics)',
 'Permutation',
 'Automaton',
 'Homotopy',
 "Hilbert's problems",
 'Imaginary unit',
 'Mental calculation',
 'Reductio ad absurdum',
 'Inverse function',
 'Probability',
 'Inequality',
 'Algebraic variety',
 'Idempotence',
 'Right-hand rule',
 'Stability',
 "Pascal's simplex",
 'Predicate logic',
 'Mathematical physics',
 'Canonical form',
 'Monty Hall problem',
 'Eye of Horus',
 'Pseudovector',
 'Reed-Solomon error correction',
 'Random sampling',
 'Category theory',
 'Exponent',
 'Combinatorics',
 'Graph',
 'Correlation',
 'Lorenz attractor',
 'Interval (mathematics)',
 'Postulate',
 'Cryptanalysis',
 'Dependent and independent variables',
 'Power series',
 'Logarithmic spiral',
 'Riemann sphere',
 'Infinity',
 'Direct proof',
 'Mathematical induction',
 'Factorial',
 'Mathematical proof',
 'Algebraic geometry',
 'Discriminant',
 'Problem',
 'Fundamental theorem of algebra',
 'Equality (mathematics)',
 'Strahler number',
 'Corollary',
 'Coprime',
 "Bayes' theorem",
 u'Hindu\u2013Arabic numeral system',
 'Kepler conjecture',
 "Pascal's Triangle",
 'Euler characteristic',
 'Topology',
 'Closure (mathematics)',
 'Bayesian network',
 'Projection (mathematics)',
 "Euler's identity",
 'Weighted average',
 'Identity element',
 'Numerical analysis',
 'Whitney embedding theorem',
 'Degree (mathematics)',
 '0.999...',
 'Probability space',
 'Discrete mathematics',
 'Infinite monkey theorem',
 'Reflexive relation',
 'Order of magnitude',
 'Mutual information',
 'Combination (mathematics)',
 'Chinese postman problem',
 'Proportions',
 'Population genetics',
 'Gamma function',
 'Decimal separator',
 "Zeno's paradoxes",
 'Daubechies wavelet',
 u'Navier\u2013Stokes equations',
 'Numerical digit',
 "Graham's number",
 'Spiral',
 'Argument',
 'Norm (mathematics)',
 'Chaos theory',
 'Topological space',
 'Gauss-Bonnet theorem',
 'Mathematics',
 'Pythagorean triple',
 'Fields Medal',
 'Signed number representations',
 'Periodic function',
 'Limit of a sequence',
 'Power of two',
 'Chain rule',
 'Fraction (mathematics)',
 'Modulo operation',
 'Cellular automaton',
 'Limit of a function',
 'Abacus',
 'Surface area to volume ratio',
 'Distance',
 'Besov space',
 'Reciprocal',
 'Universe of discourse',
 'Relation (mathematics)',
 'Flux',
 'Binary adder',
 'Aleph one',
 'Arithmetic precision',
 '4D',
 'Dimensionless quantity',
 'Aleph null',
 'Trigonometric function',
 'Lemma (mathematics)',
 'Einstein field equations',
 'Charge conjugation',
 'Ordered pair',
 'Sequence',
 'Mathematics Subject Classification',
 'Series',
 'Decimal',
 'Trigonometry',
 'Commutative property',
 'Complexity class',
 'Clay Mathematics Institute',
 'Mediant (mathematics)',
 'Computer Algebra System',
 '4 (number)',
 'Place value',
 'Order of operations',
 'Compass and straightedge construction',
 'Scatter graph',
 'Pure mathematics',
 'Mathematical constant']

Example of Simple Wikipedia article:


In [31]:
# Documents are lists of sentences
sd["Ratio"]


Out[31]:
[u'A ratio between two or more quantities is a way of measuring their sizes compared to each other. ',
 u'For example, if a school has 20 teachers and 500 pupils then the ratio of teachers to students is written as 20:500 (read as "20 to 500"). For another example, if a cake mix asks for 100 grams of flour, 300 grams of butter and 25 grams of sugar then the ratio of flour to butter to sugar is written as 100:300:25 (read as "100 to 300 to 25"). ',
 u"Ratios can be simplified. In the school example, there were 20 teachers to 500 pupils. If we divided the children up into equally sized classes then each teacher's class would have 25 pupils. That means that for each teacher there are 25 pupils, i.e. the teacher to pupil ratio is 1:25. Another way to go from the ratio 20:500 to the ratio 1:25 is to simply divide both numbers in 20:500 by 20. Note: The ratio 20:500 is the same as 1:25. They are just two ways of writing the same thing. Just like there are different ways of writing a fraction (for example ), there are many ways of writing a ratio. "]

... vs. English Wikipedia article


In [32]:
md['Ratio']


Out[32]:
[u'In mathematics, a ratio is a relationship between two numbers of the same kind ("e.g.", objects, persons, students, spoonfuls, units of whatever identical dimension), expressed as "a" to "b" or "a":"b", sometimes expressed arithmetically as a dimensionless quotient of the two that explicitly indicates how many times the first number contains the second (not necessarily an integer).',
 u"In layman's terms a ratio represents, for every amount of one thing, how much there is of another thing. For example, supposing one has 8 oranges and 6 lemons in a bowl of fruit, the ratio of oranges to lemons would be 4:3 (which is equivalent to 8:6) while the ratio of lemons to oranges would be 3:4. Additionally, the ratio of oranges to the total amount of fruit is 4:7 (equivalent to 8:14). The 4:7 ratio can be further converted to a fraction of 4/7 to represent how much of the fruit is oranges.",
 u'Notation and terminology.',
 u'The ratio of numbers "A" and "B" can be expressed as:',
 u'The numbers "A" and "B" are sometimes called "terms" with "A" being the "antecedent" and "B" being the "consequent".',
 u'The proportion expressing the equality of the ratios "A":"B" and "C":"D" is written',
 u'"A":"B" = "C":"D" or "A":"B"::"C":"D". This latter form, when spoken or written in the English language, is often expressed as',
 u'"A" is to "B" as "C" is to "D".',
 u'"A", "B", "C" and "D" are called the terms of the proportion. "A" and "D" are called the "extremes", and "B" and "C" are called the "means". The equality of three or more proportions is called a continued proportion.',
 u'Ratios are sometimes used with three or more terms. The ratio of the dimensions of a "two by four" that is ten inches long is 2:4:10. A good concrete mix is sometimes quoted as 1:2:4 for the ratio of cement to sand to gravel.',
 u'For a mixture of 4/1 cement to water, it could be said that the ratio of cement to water is 4:1, that there is 4 times as much cement as water, or that there is a quarter (1/4) as much water as cement..',
 u'Older televisions have a 4:3 "aspect ratio", which means that the width is 4/3 of the height; modern widescreen TVs have a 16:9 aspect ratio.',
 u'History and etymology.',
 u'It is impossible to trace the origin of the "concept" of ratio, because the ideas from which it developed would have been familiar to preliterate cultures. For example, the idea of one village being twice as large as another is so basic that it would have been understood in prehistoric society. However, it is possible to trace the origin of the word "ratio" to the Ancient Greek \u03bb\u03cc\u03b3\u03bf\u03c2 ("logos"). Early translators rendered this into Latin as "ratio" ("reason"; as in the word "rational"). (A rational number may be expressed as the quotient of two integers.) A more modern interpretation of Euclid\'s meaning is more akin to computation or reckoning. Medieval writers used the word "proportio" ("proportion") to indicate ratio and "proportionalitas" ("proportionality") for the equality of ratios.',
 u"Euclid collected the results appearing in the Elements from earlier sources. The Pythagoreans developed a theory of ratio and proportion as applied to numbers. The Pythagoreans' conception of number included only what would today be called rational numbers, casting doubt on the validity of the theory in geometry where, as the Pythagoreans also discovered, incommensurable ratios (corresponding to irrational numbers) exist. The discovery of a theory of ratios that does not assume commensurability is probably due to Eudoxus of Cnidus. The exposition of the theory of proportions that appears in Book VII of The Elements reflects the earlier theory of ratios of commensurables.",
 u'The existence of multiple theories seems unnecessarily complex to modern sensibility since ratios are, to a large extent, identified with quotients. This is a comparatively recent development however, as can be seen from the fact that modern geometry textbooks still use distinct terminology and notation for ratios and quotients. The reasons for this are twofold. First, there was the previously mentioned reluctance to accept irrational numbers as true numbers. Second, the lack of a widely used symbolism to replace the already established terminology of ratios delayed the full acceptance of fractions as alternative until the 16th century.',
 u"Euclid's definitions.",
 u'Book V of Euclid\'s Elements has 18 definitions, all of which relate to ratios. In addition, Euclid uses ideas that were in such common usage that he did not include definitions for them. The first two definitions say that a "part" of a quantity is another quantity that "measures" it and conversely, a "multiple" of a quantity is another quantity that it measures. In modern terminology, this means that a multiple of a quantity is that quantity multiplied by an integer greater than one\u2014and a part of a quantity (meaning aliquot part) is a part that, when multiplied by an integer greater than one, gives the quantity.',
 u'Euclid does not define the term "measure" as used here, However, one may infer that if a quantity is taken as a unit of measurement, and a second quantity is given as an integral number of these units, then the first quantity "measures" the second. Note that these definitions are repeated, nearly word for word, as definitions 3 and 5 in book VII.',
 u'Definition 3 describes what a ratio is in a general way. It is not rigorous in a mathematical sense and some have ascribed it to Euclid\'s editors rather than Euclid himself. Euclid defines a ratio as between two quantities "of the same type", so by this definition the ratios of two lengths or of two areas are defined, but not the ratio of a length and an area. Definition 4 makes this more rigorous. It states that a ratio of two quantities exists when there is a multiple of each that exceeds the other. In modern notation, a ratio exists between quantities "p" and "q" if there exist integers "m" and "n" so that "mp">"q" and "nq">"p". This condition is known as the Archimedean property.',
 u'Definition 5 is the most complex and difficult. It defines what it means for two ratios to be equal. Today, this can be done by simply stating that ratios are equal when the quotients of the terms are equal, but Euclid did not accept the existence of the quotients of incommensurables, so such a definition would have been meaningless to him. Thus, a more subtle definition is needed where quantities involved are not measured directly to one another. Though it may not be possible to assign a rational value to a ratio, it is possible to compare a ratio with a rational number. Specifically, given two quantities, "p" and "q", and a rational number "m"/"n" we can say that the ratio of "p" to "q" is less than, equal to, or greater than "m"/"n" when "np" is less than, equal to, or greater than "mq" respectively. Euclid\'s definition of equality can be stated as that two ratios are equal when they behave identically with respect to being less than, equal to, or greater than any rational number. In modern notation this says that given quantities "p", "q", "r" and "s", then "p":"q"::"r":"s" if for any positive integers "m" and "n", "np"<"mq", "np"="mq", "np">"mq" according as "nr"<"ms", "nr"="ms", "nr">"ms" respectively. There is a remarkable similarity between this definition and the theory of Dedekind cuts used in the modern definition of irrational numbers.',
 u'Definition 6 says that quantities that have the same ratio are "proportional" or "in proportion". Euclid uses the Greek \u1f00\u03bd\u03b1\u03bb\u03cc\u03b3\u03bf\u03bd (analogon), this has the same root as \u03bb\u03cc\u03b3\u03bf\u03c2 and is related to the English word "analog".',
 u'Definition 7 defines what it means for one ratio to be less than or greater than another and is based on the ideas present in definition 5. In modern notation it says that given quantities "p", "q", "r" and "s", then "p":"q">"r":"s" if there are positive integers "m" and "n" so that "np">"mq" and "nr"\u2264"ms".',
 u'As with definition 3, definition 8 is regarded by some as being a later insertion by Euclid\'s editors. It defines three terms "p", "q" and "r" to be in proportion when "p":"q"::"q":"r". This is extended to 4 terms "p", "q", "r" and "s" as "p":"q"::"q":"r"::"r":"s", and so on. Sequences that have the property that the ratios of consecutive terms are equal are called geometric progressions. Definitions 9 and 10 apply this, saying that if "p", "q" and "r" are in proportion then "p":"r" is the "duplicate ratio" of "p":"q" and if "p", "q", "r" and "s" are in proportion then "p":"s" is the "triplicate ratio" of "p":"q". If "p", "q" and "r" are in proportion then "q" is called a "mean proportional" to (or the geometric mean of) "p" and "r". Similarly, if "p", "q", "r" and "s" are in proportion then "q" and "r" are called two mean proportionals to "p" and "s".',
 u'Number of terms and use of fractions.',
 u'In general, a comparison of the quantities of a two-entity ratio can be expressed as a fraction derived from the ratio. For example, in a ratio of 2:3, the amount, size, volume, or quantity of the first entity is formula_2 that of the second entity.',
 u'If there are 2 oranges and 3 apples, the ratio of oranges to apples is 2:3, and the ratio of oranges to the total number of pieces of fruit is 2:5. These ratios can also be expressed in fraction form: there are 2/3 as many oranges as apples, and 2/5 of the pieces of fruit are oranges. If orange juice concentrate is to be diluted with water in the ratio 1:4, then one part of concentrate is mixed with four parts of water, giving five parts total; the amount of orange juice concentrate is 1/4 the amount of water, while the amount of orange juice concentrate is 1/5 of the total liquid. In both ratios and fractions, it is important to be clear what is being compared to what, and beginners often make mistakes for this reason.',
 u'Fractions can also be inferred from ratios with more than two entities; however, a ratio with more than two entities cannot be completely converted into a single fraction, because a fraction can only compare two quantities. A separate fraction can be used to compare the quantities of any two of the entities covered by the ratio: for example, from a ratio of 2:3:7 we can infer that the quantity of the second entity is formula_3 that of the third entity.',
 u'Proportions and percentage ratios.',
 u'If we multiply all quantities involved in a ratio by the same number, the ratio remains valid. For example, a ratio of 3:2 is the same as 12:8. It is usual either to reduce terms to the lowest common denominator, or to express them in parts per hundred (percent).',
 u'If a mixture contains substances A, B, C and D in the ratio 5:9:4:2 then there are 5 parts of A for every 9 parts of B, 4 parts of C and 2 parts of D. As 5+9+4+2=20, the total mixture contains 5/20 of A (5 parts out of 20), 9/20 of B, 4/20 of C, and 2/20 of D. If we divide all numbers by the total and multiply by 100%, we have converted to percentages: 25% A, 45% B, 20% C, and 10% D (equivalent to writing the ratio as 25:45:20:10).',
 u'If the two or more ratio quantities encompass all of the quantities in a particular situation, it is said that "the whole" contains the sum of the parts: for example, a fruit basket containing two apples and three oranges and no other fruit is made up of two parts apples and three parts oranges. In this case, formula_4, or 40% of the whole is apples and formula_5, or 60% of the whole is oranges. This comparison of a specific quantity to "the whole" is called a proportion.',
 u'Reduction.',
 u'Ratios can be reduced (as fractions are) by dividing each quantity by the common factors of all the quantities. As for fractions, the simplest form is considered that in which the numbers in the ratio are the smallest possible integers.',
 u'Thus, the ratio 40:60 is equivalent in meaning to the ratio 2:3, the latter being obtained from the former by dividing both quantities by 20. Mathematically, we write 40:60 = 2:3, or equivalently 40:60::2:3. The verbal equivalent is "40 is to 60 as 2 is to 3."',
 u'A ratio that has integers for both quantities and that cannot be reduced any further (using integers) is said to be in simplest form or lowest terms.',
 u'Sometimes it is useful to write a ratio in the form 1:"x" or "x":1, where "x" is not necessarily an integer, to enable comparisons of different ratios. For example, the ratio 4:5 can be written as 1:1.25 (dividing both sides by 4) Alternatively, it can be written as 0.8:1 (dividing both sides by 5).',
 u'Where the context makes the meaning clear, a ratio in this form is sometimes written without the 1 and the colon, though, mathematically, this makes it a factor or multiplier.',
 u'Dilution ratio.',
 u'Ratios are often used for simple dilutions applied in chemistry and biology. A simple dilution is one in which a unit volume of a liquid material of interest is combined with an appropriate volume of a solvent liquid to achieve the desired concentration. The dilution factor is the total number of unit volumes in which the material is dissolved. The diluted material must then be thoroughly mixed to achieve the true dilution. For example, a 1:5 dilution (verbalize as "1 to 5" dilution) entails combining 1 unit volume of solute (the material to be diluted) with (approximately) 4 unit volumes of the solvent to give 5 units of total volume. (Some solutions and mixtures take up slightly less volume than their components.)',
 u'The dilution factor is frequently expressed using exponents: 1:5 would be 5e\u22121 (5\u22121 i.e. one-fifth:one); 1:100 would be 10e\u22122 (10\u22122 i.e. one hundredth:one), and so on.',
 u'There is often confusion between dilution ratio (1:n meaning 1 part solute to n parts solvent) and dilution factor (1:n+1) where the second number (n+1) represents the total volume of solute + solvent. In scientific and serial dilutions, the given ratio (or factor) often means the ratio to the final volume, not to just the solvent. The factors then can easily be multiplied to give an overall dilution factor.',
 u'In other areas of science such as pharmacy, and in non-scientific usage, a dilution is normally given as a plain ratio of solvent to solute.',
 u'Irrational ratios.',
 u'Some ratios are between incommensurable quantities\u2014quantities whose ratio is an irrational number. The earliest discovered example, found by the Pythagoreans, is the ratio of the diagonal to the side of a square, which is the square root of 2.',
 u"The ratio of a circle's circumference to its diameter is called pi, and is not only irrational but also transcendental.",
 u'Another well-known example is the golden ratio, which is defined as both sides of the equality "a:b" = ("a+b"):"a". Writing this in fractional terms as formula_6 and finding the positive solution gives the golden ratio formula_7 which is irrational. Thus at least one of "a" and "b" has to be irrational for them to be in the golden ratio. An example of an occurrence of the golden ratio is as the limiting value of the ratio of two successive Fibonacci numbers: even though the "n"-th such ratio is the ratio of two integers and hence is rational, the limit of the sequence of these ratios as "n" goes to infinity is the irrational golden ratio.',
 u'Similarly, the silver ratio is defined as both sides of the equality "a:b" = (2"a+b"):"a". Again writing it in fractional terms and obtaining the positive solution, we obtain formula_8 which is irrational, so of two quantities "a" and "b" in the silver ratio at least one of them must be irrational.',
 u'Odds.',
 u'"Odds" (as in gambling) are expressed as a ratio. For example, odds of "7 to 3 against" (7:3) mean that there are seven chances that the event will not happen to every three chances that it will happen. The probability of success is 30%. In every ten trials, there are expected to be three wins and seven losses.',
 u'Different units.',
 u'Ratios are unitless when they relate quantities in units of the same dimension.',
 u'For example, the ratio 1 minute : 40 seconds can be reduced by changing the first value to 60 seconds. Once the units are the same, they can be omitted, and the ratio can be reduced to 3:2.',
 u'In chemistry, mass concentration "ratios" are usually expressed as w/v percentages, and are really proportions.',
 u'For example, a concentration of 3% w/v usually means 3g of substance in every 100mL of solution. This cannot easily be converted to a pure ratio because of density considerations, and the second figure is the "total" amount, not the volume of solvent.',
 u'Financial ratios.',
 u'Various financial ratios are used in the fundamental analysis of a business, for example the price\u2013earnings ratio is commonly quoted for shares.',
 u'Triangular coordinates.',
 u'The locations of points relative to a triangle with vertices "A", "B", and "C" and sides "AB", "BC", and "CA" are often expressed in extended ratio form as "triangular coordinates". ',
 u'In barycentric coordinates, a point with coordinates formula_9 is the point upon which a weightless sheet of metal in the shape and size of the triangle would exactly balance if weights were put on the vertices, with the ratio of the weights at "A" and "B" being formula_10 the ratio of the weights at "B" and "C" being formula_11 and therefore the ratio of weights at "A" and "C" being formula_12',
 u'In trilinear coordinates, a point with coordinates "x:y:z" has perpendicular distances to side "BC" (across from vertex "A") and side "CA" (across from vertex "B") in the ratio "x:y", distances to side "CA" and side "AB" (across from "C") in the ratio "y:z", and therefore distances to sides "BC" and "AB" in the ratio "x:z".',
 u'Since all information is expressed in terms of ratios (the individual numbers denoted by formula_13 "x, y," and "z" have no meaning by themselves), a triangle analysis using barycentric or trilinear coordinates applies regardless of the size of the triangle.']

We will attempt to quantify the difference by calculating the term frequencies for words in Simple vs. Regular Wikipedia, and look at their difference.


In [27]:
def text_dict_to_term_dict(d):
    '''Transform the text document dictionary to a term document matrix
    by tokenizing, lemmatizing, lowercasing, and picking only lemmas
    that are all alphabetical. '''
    lemmatizer = nltk.WordNetLemmatizer()
    term_matrix = defaultdict(int)
    all_counts = 0
    for title in d:
        for paragraph in d[title]:
            # Tokenize lowercase words
            tokens = nltk.word_tokenize(paragraph.lower())
            
            # Lemmatize words
            lemmas = map(lemmatizer.lemmatize, tokens)
            for lem in lemmas:
                
                # Remove non-alphabetical tokens
                if lem.isalpha():
                    term_matrix[lem] += 1
                    all_counts += 1
    
    for x in term_matrix:
        term_matrix[x] /= float(all_counts)
    
    return term_matrix, all_counts

In [28]:
# This takes several seconds
sd_term_matrix, sd_count = text_dict_to_term_dict(sd)
md_term_matrix, md_count = text_dict_to_term_dict(md)

We expect the Simple Wikipedia term matrix to have fewer unique vocabulary words than English Wikipedia, if only because there were more words in regular Wikipedia.


In [29]:
len(sd_term_matrix), len(md_term_matrix), sd_count, md_count


Out[29]:
(3859, 12412, 49242, 386471)

In [9]:
sd_terms = set(sd_term_matrix)
md_terms = set(md_term_matrix)

In [10]:
term_difference = {}

# Find differences in the term frequencies of the two corpora
for term in md_terms.union(sd_terms):
    term_difference[term] = md_term_matrix[term] - sd_term_matrix[term]

In [11]:
sorted_term_difference = sorted(term_difference.items(), key=lambda x: x[1])

2525 of the 3859 words that show up in Simple Wikipedia have smaller term frequencies than their English Wikipedia cousins. One of the reasons for this might be that there was more content in English Wikipedia, so there were more unique words. Another reason is that because of the simple nature of Simple Wikipedia, users tended to rely on the most frequent words more often.


In [12]:
len([x for x in sorted_term_difference if x[1] < 0])


Out[12]:
2525

In [13]:
term_difference_df = pd.DataFrame(sorted_term_difference, columns=['term', 'term_difference'])
term_difference_df['en_tf'] = term_difference_df.term.apply(lambda x: md_term_matrix[x])
term_difference_df['simple_tf'] = term_difference_df.term.apply(lambda x: sd_term_matrix[x])

Words that show up relatively more often in Simple Wikipedia than English Wikipedia may either be non-technical versions of words that aren't often used by domain experts, as well as circumlocutions for explaining math in simple terms. Looking at the top twenty of these words, we see concepts like "number", "rounding", and "mathematics", as well as familiarisms like "you" and "thing". There is surprising subtlety to how words like "is", "it", "or", "we", and "are" show up. Perhaps professional mathematicians have alternate ways of expressing relationships other than conjugations of "to be", and prefer explicit variable names to pronouns.


In [18]:
# Words most characteristic of Simple Wikipedia's math category pages
term_difference_df.head(20)


Out[18]:
term term_difference en_tf simple_tf
0 is -0.008253 0.028403 0.036656
1 it -0.004921 0.008076 0.012997
2 number -0.004722 0.008153 0.012875
3 or -0.003697 0.005868 0.009565
4 we -0.002280 0.001599 0.003879
5 are -0.002103 0.009716 0.011819
6 this -0.002081 0.006733 0.008814
7 rounding -0.001862 0.000331 0.002193
8 mathematics -0.001822 0.002179 0.004001
9 will -0.001750 0.001296 0.003046
10 thing -0.001622 0.000145 0.001767
11 you -0.001572 0.000215 0.001787
12 how -0.001520 0.000572 0.002092
13 way -0.001481 0.001038 0.002518
14 used -0.001475 0.002525 0.004001
15 they -0.001405 0.001418 0.002823
16 same -0.001325 0.001457 0.002782
17 would -0.001313 0.000900 0.002214
18 can -0.001251 0.006303 0.007555
19 example -0.001250 0.003705 0.004955

The top 20 words that distinguish English Wikipedia from Simple Wikipedia are variables, common technical terms, and conjunctions, prepositions, and articles that aid mathematical language. Within the subset of 229 articles we examined, x is the most used variable name, followed by n, then f (a function name), p (probability), b, and k. Note that this is term frequency and not document frequency, so they may be inflated from the way they are rewritten several times within documents during derivations.

Technical words like "function", "space", "theory", "group", "distribution", and "manifold" are much more common in higher-level maths.

I can easily see how "such" and "by" are characteristic of technical math, since many proofs contain phrases like "by the X theorem..." or "such that Y holds". I'm not sure how to explain "of" or "the", but from glancing through a few proofs, it looks like they are often used as glue words.


In [19]:
# Words most characteristic of English Wikipedia's math category pages
term_difference_df.tail(20)[::-1]


Out[19]:
term term_difference en_tf simple_tf
12778 of 0.006827 0.044518 0.037691
12777 the 0.004993 0.074182 0.069189
12776 x 0.003223 0.005335 0.002112
12775 and 0.002756 0.022840 0.020084
12774 n 0.002700 0.004528 0.001828
12773 in 0.002212 0.022439 0.020227
12772 by 0.002091 0.008021 0.005930
12771 f 0.001692 0.002342 0.000650
12770 function 0.001451 0.005858 0.004407
12769 space 0.001416 0.002533 0.001117
12768 theory 0.001387 0.003133 0.001746
12767 p 0.001339 0.001785 0.000447
12766 such 0.001068 0.003180 0.002112
12765 b 0.001053 0.002230 0.001178
12764 group 0.000863 0.001736 0.000873
12763 on 0.000850 0.005480 0.004630
12762 from 0.000845 0.003871 0.003026
12761 k 0.000832 0.001035 0.000203
12760 distribution 0.000824 0.001027 0.000203
12759 manifold 0.000812 0.001076 0.000264

In [34]:
term_difference_df.to_csv('term_differences.csv', index=False, encoding='utf8')